Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

نویسندگان

چکیده

Abstract Personalized voice triggering is a key technology in assistants and serves as the first step for users to activate assistant. involves keyword spotting (KWS) speaker verification (SV). Conventional approaches this task include developing KWS SV systems separately. This paper proposes single system called multi-task deep cross-attention network (MTCANet) that simultaneously performs SV, while effectively utilizing information relevant both tasks. The proposed framework integrates sub-network an enhance performance challenging conditions such noisy environments, short-duration speech, model generalization. At core of MTCANet are three modules: novel (DCA) module integrate tasks, multi-layer stacked shared encoder (SE) reduce impact noise on recognition rate, soft attention (SA) modules allow focus pertinent middle layer preventing gradient vanishing. Our demonstrates outstanding well-off test set, improving by 0.2%, 0.023, 2.28% over well-known emphasized channel attention, propagation, aggregation time delay neural (ECAPA-TDNN) advanced Convmixer terms equal error rate (EER), minimum detection cost function (minDCF), accuracy (Acc), respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting

We propose improved Deep Neural Network (DNN) training loss functions for more accurate single keyword spotting on resource-constrained embedded devices. The loss function modifications consist of a combination of multi-task training and weighted cross entropy. In the multi-task architecture, the keyword DNN acoustic model is trained with two tasks in parallel the main task of predicting the ke...

متن کامل

Transferable Deep Features for Keyword Spotting

Deep features, defined as the activations of hidden layers of a neural network, have given promising results applied to various vision tasks. In this paper, we explore the usefulness and transferability of deep features, applied in the context of the problem of keyword spotting (KWS). We use a state-ofthe-art deep convolutional network to extract deep features. The optimal parameters concerning...

متن کامل

Confidence Measure for Utterance Verification in Keyword Spotting System

In this article, we propose an utterance verification technique for keyword spotting. The keyword spotting system analyzes a given spoken content and searches every speech segment in which one of pre-defined keywords is uttered. To maintain a stable recognition performance in the system, we propose an utterance verification technique that verifies whether a found utterance, or a candidate keywo...

متن کامل

Multi-task learning for text-dependent speaker verification

Text-dependent speaker verification uses short utterances and verifies both speaker identity and text contents. Due to this nature, traditional state-of-the-art speaker verification approaches, such as i-vector, may not work well. Recently, there has been interest of applying deep learning to speaker verification, however in previous works, standalone deep learning systems have not achieved sta...

متن کامل

Spoken keyword spotting via multi-lattice alignment

We propose a method for finding keywords in an audio database using a spoken query. Our method is based on performing a joint alignment between a phone lattice generated from a spoken utterance query and a second phone lattice representing a long utterance needing to be searched. We implement this joint alignment procedure in a graphical models framework. We evaluate our system on TIMIT as well...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Eurasip Journal on Audio, Speech, and Music Processing

سال: 2023

ISSN: ['1687-4722', '1687-4714']

DOI: https://doi.org/10.1186/s13636-023-00293-8